Week 01
Introduction and Overview

SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research


Semester 1, 2026
Last updated: 2026-02-20

Francesco Bailo

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Note

These slides are developed based on:

  • Alexander, R. (2023). Telling Stories with Data: With Applications in R. CRC Press. TSwD
  • Gelman, A., Hill, J., & Vehtari, A. (2021). Regression and Other Stories. Cambridge University Press. ROS

Students are encouraged to refer to the relevant chapters for additional detail and examples.

What is Data Science?

Learning objectives

By the end of this seminar, you will be able to:

  1. Understand the role of data analysis in social science research
  2. Explain the three fundamental challenges of statistics
  3. Describe the Plan-Simulate-Acquire-Explore-Share workflow
  4. Understand data validity and reliability
  5. Set up R and RStudio environment

What is data science?

A working definition

Data science is humans measuring things, typically related to other humans, and using sophisticated averaging to explain and predict.

This definition emphasises:

  • Data are generated by humans and about humans
  • The process of turning the world into data involves many decisions
  • Analysis involves explanation and prediction

If data science is the process of turning the world into measurable data, it involves many decisions…

Is data science “subjective” or “objective”?

Or maybe… “intersubjective”?

Data science in social research

Data science allows us to:

  1. Predict outcomes (elections, behaviour, health)
  2. Explore associations (risk factors, attitudes, characteristics)
  3. Extrapolate from samples to populations
  4. Make causal inferences about treatments and interventions

“In any case the key elements are the same… What is the dataset? Who generated it and why? What is missing?” (Alexander, 2023, p. 3) TSwD

The challenge of measurement

Even seemingly simple things are hard to measure:

  • Height: Changes during the day; tape vs laser gives different results
  • Happiness: How do we quantify subjective feelings?
  • Income: Before or after tax? Per person or household?

Illustration of simplified representation

Picasso’s single-line dog drawing

What is the minimum we need to capture the essence?

A1 In-Class Exercise

📋 Head to Canvas and complete Section 1: What is Data Science? of the A1 in-class exercise of this week


05:00

The Three Challenges of Statistics

The three challenges

Fundamental challenges of statistical inference

  1. Generalising from sample to population
  2. Generalising from treatment to control group
  3. Generalising from observed measurements to underlying constructs

These challenges arise in nearly every application of data analysis!

Challenge 1: Sample to population

The problem: We usually only observe a sample of the population we care about.

Examples:

  • Surveys don’t reach everyone
  • Not everyone responds
  • Some groups are harder to reach

Selection bias: The people we observe may differ systematically from those we don’t.

Who is systematically missing from our data?

Challenge 2: Treatment to control

The problem: We want to know what would have happened if we had made a different choice.

In experiments:

  • Randomly assign treatment
  • Compare outcomes
  • Estimate causal effect

In observational studies:

  • Treatment is not randomly assigned
  • Groups may differ before treatment
  • Must adjust for differences

Correlation ≠ Causation

Just because two things are associated doesn’t mean one causes the other!

Challenge 3: Measurement to constructs

The problem: What we measure is rarely what we actually want to know.

What we measure

  • Survey responses
  • Test scores
  • Administrative records
  • Social media posts

What we want to know

  • True opinions
  • Actual ability
  • Real behaviour
  • Population sentiment

“Most of the time our data do not record exactly what we would ideally like to study.” (Gelman et al., 2021, p. 3) ROS

Example: The Human Development Index

The HDI claims to measure “human development” using:

  • Life expectancy
  • Education (literacy + enrolment)
  • Standard of living (GDP per capita)

But: Most variation between US states comes from income, not health or education!

The lesson: Always examine where your numbers come from.

What does your measure actually capture?

A1 In-Class Exercise

📋 Head to Canvas and complete Section 2: The Three Challenges of the A1 in-class exercise of this week


05:00

Cogniti: Your AI Coaches for This Unit

We use Cogniti chatbots throughout the semester — access them all from Canvas.

There are two types:

🤖 R Coach

Helps you with R programming:

  • Debugging your code
  • Putting together the right code for your analysis
  • Available throughout the semester

📚 Weekly Coach

Has the assigned readings for that week loaded in its context:

  • Use it for targeted questions about weekly content
  • If your question is from another week, it will point you to the right weekly coach

How to access

Go to Canvas → find the Cogniti chatbots for this unit

Data Quality: Validity and Reliability

Validity

Definition

A measure is valid to the degree that it represents what you are trying to measure.

Examples of validity problems:

  • A written test as a measure of musical ability ❌
  • Customer satisfaction surveys as a measure of service effectiveness ❓
  • Blood pressure readings as a measure of cardiovascular health ✓

Key question: Is there general agreement that the observations are closely related to the intended construct?

Reliability

Definition

A reliable measure is one that is precise and stable—if we measure again, we get similar values.

Ways to assess reliability:

  • Test-retest: Give the same test twice
  • Inter-rater: Have different people make the same measurement
  • Internal consistency: Do related items give similar results?

The key insight

Variability in our data should reflect real differences, not measurement error.

Validity vs reliability

High validity, low reliability

  • Measuring the right thing
  • But inconsistently
  • Example: Accurate but shaky scale

High reliability, low validity

  • Consistent measurements
  • But of the wrong thing
  • Example: Precisely measuring height when you want to know weight

We need both!

A measure can be reliable without being valid, but a valid measure must be reasonably reliable.

A1 In-Class Exercise

📋 Head to Canvas and complete Section 3: Validity and Reliability of the A1 in-class exercise of this week


05:00

The Data Science Workflow

The five-step workflow

Plan → Simulate → Acquire → Explore → Share

This workflow guides everything we do in this course.

Step 1: Plan

Why plan first?

“In Alice’s Adventures in Wonderland, Alice asks the Cheshire Cat which way she should go. The Cat replies that it depends on where Alice wants to get to.”

Planning involves:

  • Sketching the endpoint (what graph/table do you want?)
  • Identifying the data you need
  • Considering who is affected by your analysis

Practical tip

Ten minutes with paper and pen is often enough to get started!

Step 2: Simulate

Why simulate data?

For data cleaning:

  • Forces you to think about data types
  • Helps define expected values
  • Creates tests for your real data

For modelling:

  • Know the “truth” in advance
  • Test if your model recovers it
  • Build confidence before real data

“Simulation is often cheap—almost free given modern computing resources—and fast.” (Alexander, 2023, p. 5) TSwD

Step 3: Acquire

Data acquisition is often overlooked but critical!

Key considerations:

  • Where does the data come from?
  • Who collected it and why?
  • What’s missing or poorly measured?
  • What decisions have already been made?

Data never “speak for themselves”

They are shaped by the choices of those who collected and prepared them.

Step 4: Explore

Exploratory Data Analysis (EDA) involves:

  • Summary statistics
  • Graphs and tables
  • Initial modelling
  • Understanding the “shape” of your data

This is an iterative process that continues throughout your project.

“It is difficult to delineate where EDA ends and formal statistical modelling begins.”

Step 5: Share

Communication is the most important element

Simple analysis, communicated well, is more valuable than complicated analysis communicated poorly.

Clear communication means:

  • Writing in plain language
  • Using appropriate tables and graphs
  • Explaining decisions and limitations
  • Making your work reproducible

Key elements of telling stories with data

  1. Communication — Clear, audience-focused writing
  2. Reproducibility — Others can redo your work
  3. Ethics — Considering who is affected
  4. Questions — Curiosity drives good research
  5. Measurement — Understanding what data capture

How assessments map to these elements

A3 A4
1. Communication
2. Reproducibility
3. Ethics (optional)
4. Questions (a single question will do!)
5. Measurement

Getting Started with R

Why R?

Advantages:

  • Free and open source
  • Huge community
  • Excellent for statistics
  • Great visualisation
  • Reproducible documents

For this course:

  • Standard in social science
  • Well-documented
  • Active package development
  • Integrates with Quarto

Open RStudio on Your Laptop!

💻 Find RStudio on your laptop and open it now


No laptop? You can use Posit Cloud — create a free account at posit.cloud


A word of caution about Posit Cloud free tier…

The free tier comes with limited hours — you could find yourself locked out halfway through a project

Also, it doesn’t work on a plane ✈️🚫 — and you will want to work on RStudio on a plane at some point

The RStudio interface

Four main panes:

  1. Source — Write and edit code
  2. Console — Run commands
  3. Environment — View objects
  4. Files/Plots/Help — Navigate and view outputs

Key shortcuts:

  • Ctrl/Cmd + Enter — Run current line
  • Ctrl/Cmd + Shift + Enter — Run chunk
  • Tab — Autocomplete
  • Ctrl/Cmd + S — Save

Your first R commands

# This is a comment - R ignores it
# Basic arithmetic
1 + 1
[1] 2
# Creating objects with the assignment operator
my_number <- 42
my_number
[1] 42
# Using functions
sqrt(my_number)
[1] 6.480741

Installing and loading packages

Packages extend R’s functionality.

# Install a package (only need to do once)
install.packages("tidyverse")

# Load a package (need to do each session)
library(tidyverse)

Key packages for this course

  • tidyverse: Data manipulation and visualisation
  • janitor: Data cleaning utilities
  • knitr: Document generation

Creating a Quarto document

Quarto combines text and code for reproducible research.

  1. File → New File → Quarto Document
  2. Add a title and your name
  3. Click “Create”

Source -> Render -> Output

When a Quarto (qmd) document is open, RStudio allows you to switch from the “Source” code to the “Visual”, or what the document looks like when “Render”(ed)

Key elements:

  • YAML header — Document settings
  • Markdown — Formatted text
  • Code chunks — Executable R code

Code chunks

Code chunks contain R code that will be executed:

```{r}
# Your R code goes here
mean(c(1, 2, 3, 4, 5))
```

Produces:

mean(c(1, 2, 3, 4, 5))
[1] 3

Chunk options

Control how chunks behave:

```{r}
#| echo: false    # Don't show the code
#| eval: true     # Do run the code
#| message: false # Hide messages
#| warning: false # Hide warnings
```

A1 In-Class Exercise

📋 Head to Canvas and complete Section 4: First R Commands of the A1 in-class exercise of this week then submit!


05:00

Worked Example: Australian Elections

Putting it all together

Let’s walk through the complete workflow with a real example:

Question: How many seats did each party win in the 2022 Australian Federal Election?

The workflow

  1. Plan: Sketch the data and graph we need
  2. Simulate: Create fake data to test our approach
  3. Acquire: Get the real data
  4. Explore: Analyse and visualise
  5. Share: Communicate our findings

Step 1: Plan

Data we need:

Division Party
Adelaide Labor
Aston Liberal

Graph we want:

A bar chart showing the number of seats won by each party.

Step 2: Simulate

# Create simulated data
simulated_data <- tibble(
  division = 1:151,  # 151 seats in the House
  party = sample(
    x = c("Liberal", "Labor", "Nationals", "Greens", "Other"),
    size = 151,
    replace = TRUE
  )
)

# Check the first few rows
head(simulated_data)
# A tibble: 6 × 2
  division party  
     <int> <chr>  
1        1 Greens 
2        2 Liberal
3        3 Greens 
4        4 Labor  
5        5 Greens 
6        6 Liberal

Step 3: Acquire

# Read data from the Australian Electoral Commission
raw_elections_data <- read_csv(
  file = paste0(
    "https://results.aec.gov.au/27966/website/Downloads/",
    "HouseMembersElectedDownload-27966.csv"
  ),
  skip = 1
)

Step 4: Explore — Counting seats

# Count seats by party
full_data |>
  count(elected_party) |>
  arrange(desc(n))
# A tibble: 5 × 2
  elected_party     n
  <chr>         <int>
1 Labor            77
2 Liberal          48
3 Other            12
4 Nationals        10
5 Greens            4

Step 4: Explore — Creating a graph

full_data |>
  ggplot(aes(x = elected_party)) +
  geom_bar(fill = "#00539b") +
  theme_minimal(base_size = 16) +
  labs(
    x = "Party",
    y = "Number of seats",
    title = "2022 Australian Federal Election Results"
  )

Step 5: Share

Australia is a parliamentary democracy with 151 seats in the House of Representatives. The 2022 Federal Election saw the Labor Party win 77 seats, followed by the Liberal Party with 48 seats.

Key findings:

  • Labor won a majority (77 seats)
  • The two major parties dominate
  • Minor parties and independents won 26 seats combined

Wrap-up

This week’s readings

Telling Stories with Data:

  • Ch 1: Telling stories with data
  • Ch 2: Drinking from a fire hose (§2.1-2.2)

Regression and Other Stories:

  • Ch 1: Overview
  • Ch 2: Data and Measurement

Reading strategy

Focus on the concepts first. The technical details will make more sense as we practise.

Key takeaways

  1. Data science is about humans measuring things to explain and predict
  2. The three challenges of statistics pervade all analysis
  3. Validity and reliability are fundamental to good measurement
  4. The Plan-Simulate-Acquire-Explore-Share workflow guides our work
  5. R is a powerful tool for reproducible data analysis

Next week

Week 2: Reproducible Workflows and Version Control

  • Creating reproducible documents with Quarto
  • Organising projects with R Projects
  • Introduction to Git and GitHub

Before next week

  • Complete and submit the A1 for Week 01 in case you haven’t done it yet (deadline is always Saturday at 23:59)
  • Complete and submit the A2 for Week 01 (deadline is always Saturday at 23:59)
  • Complete the readings for Week 02
  • Go through the “Before Class” section of the Problem Set for Week 02 (if we have time we can see an example now…)

Questions?

Attendance!

References

Alexander, R. (2023). Telling stories with data: With applications in R. Boca Raton: CRC Press.
Gelman, A., Hill, J., & Vehtari, A. (2021). Regression and Other Stories. Cambridge University Press.